# ik_llama.cpp

## Docs

- [Building from source](https://mintlify.wiki/ikawrakow/ik_llama.cpp/building.md): Build ik_llama.cpp for CPU, CUDA, Metal, ROCm, and other backends
- [Android deployment](https://mintlify.wiki/ikawrakow/ik_llama.cpp/deployment/android.md): Run ik_llama.cpp on Android using Termux or the Android NDK
- [Docker deployment](https://mintlify.wiki/ikawrakow/ik_llama.cpp/deployment/docker.md): Run ik_llama.cpp in a Docker or Podman container
- [Performance troubleshooting](https://mintlify.wiki/ikawrakow/ik_llama.cpp/deployment/performance-tips.md): Diagnose and improve token generation speed
- [FlashMLA](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/flash-mla.md): Optimized Multi-Head Latent Attention for DeepSeek models on CPU and CUDA
- [Function calling](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/function-calling.md): Use OpenAI-style tool calling with any model via Jinja templates
- [Multimodal (vision)](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/multimodal.md): Run vision-language models with image input using llama-mtmd-cli and llama-server
- [Speculative decoding](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/speculative-decoding.md): Accelerate token generation with draft models, n-gram caches, and ngram-mod
- [GPU offloading](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/gpu-offload.md): Configure GPU offloading to maximize inference performance with CUDA
- [Hybrid CPU/GPU inference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/hybrid-cpu-gpu.md): Run large models that don't fit in VRAM using RAM+VRAM hybrid offloading
- [Parameters reference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/parameters.md): Complete reference for ik_llama.cpp command-line parameters
- [Running the server](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/server.md): Start the llama-server for OpenAI-compatible LLM inference with a built-in WebUI
- [Introduction](https://mintlify.wiki/ikawrakow/ik_llama.cpp/introduction.md): What is ik_llama.cpp and how does it differ from llama.cpp?
- [Importance matrix (imatrix)](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/imatrix.md): Generate and use importance matrices to improve quantization quality
- [IQK quantization types](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/iqk-quants.md): State-of-the-art IQK quantization formats exclusive to ik_llama.cpp
- [Quantization overview](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/overview.md): Understanding quantization types in ik_llama.cpp: IQK, Trellis, legacy k-quants, and more
- [Trellis quantization](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/trellis-quants.md): IQ1_KT, IQ2_KT, IQ3_KT, IQ4_KT: novel integer trellis-based quantization for extreme compression
- [Quickstart](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quickstart.md): Get ik_llama.cpp running in minutes on CPU or GPU
- [Build options](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/build-options.md): CMake flags and environment variables for building ik_llama.cpp
- [llama-server reference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/cli-server.md): CLI flags for the llama-server inference server
- [CLI tools reference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/cli-tools.md): llama-cli, llama-quantize, llama-imatrix, llama-bench, llama-sweep-bench
- [Model formats and conversion](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/model-formats.md): GGUF format, model splits, and HuggingFace conversion
- [Supported models](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/supported-models.md): Model families supported by ik_llama.cpp